1 Goal

  1. Become familiar with the basic predictive machine learning landscape
  2. appropriately preprocess the UCI heart disease dataset
  3. fit single model to train/test sets and evaluate performance
  4. deploy a cross-validated ensemble of multiple algorithms for simultaneous comparison

2 Introduction: What is machine learning?

Machine learning is a combination of statistics and computer science that is often defined as an algorithm or computer that can learn on its own as it is given more data and without human input. This definition easily becomes conflated with opaque ideas of “artificial intelligence” and is frequently meant to apply only to research in transportation, robotics, cognition, and human-computer interaction.

We thankfully do not need to think of machine learning in such complicated terms. Instead, it can be thought of as a toolbox for exploring and modeling data for application to research problems in virtually all fields of study. Is there something you want to predict or infer? Do you want to explore the structure of a dataset? If so, there is a good chance that machine learning might be able to help.

We can think of machine learning as a

“vast set of tools for understanding data. These tools can be classified as supervised or unsupervised. Broadly speaking, supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs… With unsupervised statistical learning, there are inputs but no supervising output; nevertheless we can learn relationships and structure from such data.” (James et al. 2021, p. 1).

James G, Witten D, Hastie T, Tibshirani R. 2021. An Introduction to Statistical Learning: With Applications in R, 2nd edition.

2.1 Machine learning terminology

You neither need to understand the mathematics behind machine learning nor get lost in romanticized notions of “AI” to responsibly apply it to your research problem. However, you should be familiar with basic vocabulary terms and jargons to understand how the different parts work together and explain its application to your research.

Machine learning tasks are generally defined as belonging to one of a small handful of groups: supervised, unsupervised, semi-supervised, deep, reinforcement, and targeted/causal.

Our focus in this workshop is on supervised learning for a classification task using the SuperLearner R package. We will follow the Guide to SuperLearner.

  1. Supervised learning: is a predictive technique that trains a model on known, labeled data. The goal is to understand the relationships between variables in a dataset so that the trained model can be used to make predictions on new data it has not yet seen and whose labels are unknown. Or,

“We wish to fit a model that relates the response to the predictors, with the aim of accurately predicting the response for future observations (prediction) or better understanding the relationship between the response and the predictors (inference).” (James et al. 2021, p. 26)

  • Classification: is the supervised task when the Y variable is discrete/categorical (\(Pr(y=1|x)\)). For binary classification, 1 is the “yes”, or positive class. 0 is the “no”, or negative class. This can also extend to multiclass problems as well.

  • Regression: is the supervised task when the Y variable is continuous (i.e., numeric or integer) (\(E(y|x)\))

  1. Supervised syntax

\(Y ~ X1 + X2 + X3 ... + Xn\)

Simply put, we want to use the X variables to predict Y.

  1. X and Y variables:
  • X is/are the independent variable(s) that we use to do the predicting. These are also referred to as features, covariates, predictors, and explanatory/input variables.

  • Y is the dependent variable and the one we want to predict. It is also referred to as the outcome, target, or response variable. Although predicting the label itself might be convenient, predicting the class probabilities is more efficient.

  • A machine learning function might look like this:

\(y=f(x)+ϵ\)

where f is the function to estimate the relationships between X to Y. The random error epsilon ϵ is independent of x and averages to zero. We can therefore use \(^y=^f(x)\) to predict Y for new data (the test dataset) and evaluate how well the algorithm learned the target function when trained and then introduced to the new data.

  1. Data splitting: Predictions are usually evaluated twice:

  2. first on the labeled training set to see how well the model can learn the relationships between the X and Y variables, and then

  3. on the test set to see how well the trained model can generalize to predicting on new data.

To accomplish this task, a given dataset is divided into training and test sets.

NOTE: a validation set is also sometimes used for hyperparameter tuning/model selection on the training dataset and are to be learned. While the parameters represent the internal configuration of the model, hyperparameters are defined by the user before training begins.

  • The training set: generally consists of the majority portion of the original dataset (70%, for example) where the model can learn the relationships between the X and Y variables.

    • Hyperparameters: are the configurations manually set by the programmer prior to model training through heuristics and trial and error, or a grid search. We do not know the optimal combinations of these options, and must tune the hyperparameters through the model training process to optimize prediction performance.
  • The test set: consists of the remaining portion of the dataset (30% in this example) that the trained model will then try to predict without seeing the Y labels.

  • k-fold cross-validation: is a preferred method for approaching the data splitting process because it repeats the train/test split process “k” number of times and rotates portions of the dataset to ensure that each observation is in the test set at least once.

  1. Performance metrics: It is necessary to evaluate model performance on the training and test sets (and validation set, when applicable) through a variety of confusion matrix derivations.

We will focus on two classification metrics in this workshop: 1. Risk - as measured by mean squared error 2. AUC-ROC - area under the curve - receiver operator characteristic; a measure of true positive rate versus true negative rate

Keep in mind that misclassification error is often an inappropriate performance metric, particularly when dealing with a Y variable whose distribution is imbalanced.

  • A model is underfit if it cannot learn the relationships between the X and Y variables on the training set.

  • A model is overfit if it adequately learns the relationships between the X and Y variables and performs well on the training set, but performs poorly on the test set.

2.2 Example machine learning workflow

  1. Read literature in your field to understand what has been done; annotate what you read
  2. Formulate research question(s)
  3. Obtain data
  4. Preprocess data
  • scale variables when necessary
  • handle missing values if present (listwise delete, median impute, multiple impute, generalized low rank model impute, etc.)
  • if present, convert factor variables to numeric indicators
  1. Define x and y variables
  2. Split data into train and test sets
  3. Train and evaluate performance of a single algorithm on as a prototype
  4. Examine the trained model’s performance on the test set
  5. Create and deploy a cross-validated ensemble

NOTE: with practice, you might find you prefer to perform the cross-validated ensemble step first, followed by a single train/test split.

3 Today’s topic: Predicting heart disease

https://theheartfoundation.org/heart-disease-facts-2/ We investigate how well individual algorithms and a SuperLearner ensemble weighted average can predict heart disease (yes/no) using other health indicators as predictors, called features. Learn more about the dataset at the UCI Machine Learning Repository

4 SuperLearner

SuperLearner is an algorithm that uses cross-validation to estimate the performance of multiple machine learning models, or the same model with different settings. It then creates an optimal weighted average of those models, aka an “ensemble”, using the test data performance. This approach has been proven to be asymptotically as accurate as the best possible prediction algorithm that is tested. Guide to SuperLearner - 1 Background

In this manner, an ensemble is a powerful tool because it:

  1. Eliminates bias of single algorithm selection for framing a research problem

  2. Allows for comparison of multiple algorithms, and/or comparison of the same model(s) but tuned in many different ways

  3. Utilizes a second-level algorithm that produces an ideal weighted prediction that is suitable for data of virtually all distributions and uses external cross-validation to prevent overfitting.

Read the papers:

5 Setup

5.1 Package installation

Install and library the packages we will use in this workshop. The pacman package management tool handles everything for you in the code chunk below. You should consider using it (or something like it) for your own research. Read the documentation here.

if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load("caret",         # create stratified random split of the dataset
               "ck37r",         # Chris Kennedy's Machine Learning Helper Toolkit
               "ggplot2",       # visualize risk and ROC curve
               "glmnet",        # elastic net algorithm
               "ranger",        # random forest algorithm
               "rpart",         # decision tree algorithm
               "scales",        # for scaling data
               "SuperLearner",  # fit individual algorithms and ensembles
               "ROCR",          # compute AUC-ROC performance
               "xgboost")       # boosted tree algorithm

5.1.1 The ck37r package

If the CRAN installation for the ck37r package does not work correctly, install it from GitHub by unhashtagging and running the three lines of code below.

If prompted to update packages, select “All” by entering a number 1 in your console and press the Enter key.

# install.packages("remotes")
# remotes::install_github("ck37/ck37r")
# library(ck37r)

6 Import and preprocess the data

Import and preprocess the data in six steps, and create a new variable for each step:

  1. Import the dataset: The variable raw contains data from the raw .csv file.

  2. Convert categorical variables: raw_fac is used to convert categorical variables to factor type.

  3. Convert factors to numeric indicators: raw_df will be for conversion of factors to numeric indicators.

  4. Identify and remove variables to be scaled: removed consists of the preprocessed data excluding the continuous variables to be scaled.

  5. Scale the continuous variables: rescaled is the variable for the scaled variables.

  6. Produce clean dataset: clean is the final merged dataset; a combination of the variables removed and rescaled.

  7. Save the clean dataset: using the save function. You can load it with the load function so you do not have to repeat these preprocessing steps again.

6.1 1. Import the dataset

Read in the raw .csv file and save it in a variable named raw.

raw <- read.csv("data/raw/heart.csv")
str(raw)
## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ target  : int  1 1 1 1 1 1 1 1 1 1 ...

6.2 2. Convert categorical variables

The variables cp, restecg, slope, ca, and thal are variables to be converted to nominal categorical type.

Save this factorized version in a variable named raw_fac.

raw_fac <- ck37r::categoricals_to_factors(data = raw, 
                                          categoricals = c("cp", "restecg", "slope", "ca", "thal"))
str(raw_fac)
## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : Factor w/ 4 levels "0","1","2","3": 4 3 2 2 1 1 2 2 3 3 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : Factor w/ 3 levels "0","1","2": 1 2 1 2 2 2 1 2 2 2 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : Factor w/ 3 levels "0","1","2": 1 1 3 3 3 2 2 3 3 3 ...
##  $ ca      : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ thal    : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 3 2 3 4 4 3 ...
##  $ target  : int  1 1 1 1 1 1 1 1 1 1 ...

6.3 3. Convert factors to numeric indicators

Most machine learning algorithms require that input variables are numeric and therefore do not handle factor data well (although with some exceptions, such as decision trees).

Therefore, we want to convert the factor variables to numeric indicator variables (aka one-hot/dummy coding). See a few examples here.

Name this variable raw_df.

# Save this as an intermediate variable named raw_ind 
raw_ind <- ck37r::factors_to_indicators(data = raw_fac, 
                                        verbose = TRUE)
## Converting factors (5): cp, restecg, slope, ca, thal
## Converting cp from a factor to a matrix (4 levels).
## : cp_1 cp_2 cp_3 
## Converting restecg from a factor to a matrix (3 levels).
## : restecg_1 restecg_2 
## Converting slope from a factor to a matrix (3 levels).
## : slope_1 slope_2 
## Converting ca from a factor to a matrix (5 levels).
## : ca_1 ca_2 ca_3 ca_4 
## Converting thal from a factor to a matrix (4 levels).
## : thal_1 thal_2 thal_3 
## Combining factor matrices into a data frame.
# Extract the actual data portion from raw_ind and save as raw_df
raw_df <- raw_ind$data
str(raw_df)
## 'data.frame':    303 obs. of  23 variables:
##  $ age      : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex      : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ trestbps : int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol     : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs      : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ thalach  : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang    : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak  : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ target   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ cp_1     : int  0 0 1 1 0 0 1 1 0 0 ...
##  $ cp_2     : int  0 1 0 0 0 0 0 0 1 1 ...
##  $ cp_3     : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ restecg_1: int  0 1 0 1 1 1 0 1 1 1 ...
##  $ restecg_2: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ slope_1  : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ slope_2  : int  0 0 1 1 1 0 0 1 1 1 ...
##  $ ca_1     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_2     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_3     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_4     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal_1   : int  1 0 0 0 0 1 0 0 0 0 ...
##  $ thal_2   : int  0 1 1 1 1 0 1 0 0 1 ...
##  $ thal_3   : int  0 0 0 0 0 0 0 1 1 0 ...

6.4 4. Identify and remove variables to be scaled

Identify and remove the variables age, trestbps, chol, thalach, and oldpeak to be scaled.

# First, investigate the data to identify variables to scale
summary(raw_df)
##       age             sex            trestbps          chol      
##  Min.   :29.00   Min.   :0.0000   Min.   : 94.0   Min.   :126.0  
##  1st Qu.:47.50   1st Qu.:0.0000   1st Qu.:120.0   1st Qu.:211.0  
##  Median :55.00   Median :1.0000   Median :130.0   Median :240.0  
##  Mean   :54.37   Mean   :0.6832   Mean   :131.6   Mean   :246.3  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:140.0   3rd Qu.:274.5  
##  Max.   :77.00   Max.   :1.0000   Max.   :200.0   Max.   :564.0  
##       fbs            thalach          exang           oldpeak    
##  Min.   :0.0000   Min.   : 71.0   Min.   :0.0000   Min.   :0.00  
##  1st Qu.:0.0000   1st Qu.:133.5   1st Qu.:0.0000   1st Qu.:0.00  
##  Median :0.0000   Median :153.0   Median :0.0000   Median :0.80  
##  Mean   :0.1485   Mean   :149.6   Mean   :0.3267   Mean   :1.04  
##  3rd Qu.:0.0000   3rd Qu.:166.0   3rd Qu.:1.0000   3rd Qu.:1.60  
##  Max.   :1.0000   Max.   :202.0   Max.   :1.0000   Max.   :6.20  
##      target            cp_1            cp_2             cp_3        
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :1.0000   Median :0.000   Median :0.0000   Median :0.00000  
##  Mean   :0.5446   Mean   :0.165   Mean   :0.2871   Mean   :0.07591  
##  3rd Qu.:1.0000   3rd Qu.:0.000   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.00000  
##    restecg_1        restecg_2         slope_1         slope_2      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :1.0000   Median :0.0000   Median :0.000   Median :0.0000  
##  Mean   :0.5017   Mean   :0.0132   Mean   :0.462   Mean   :0.4686  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.000   Max.   :1.0000  
##       ca_1             ca_2             ca_3              ca_4       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.2145   Mean   :0.1254   Mean   :0.06601   Mean   :0.0165  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
##      thal_1            thal_2           thal_3      
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.00000   Median :1.0000   Median :0.0000  
##  Mean   :0.05941   Mean   :0.5479   Mean   :0.3861  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000
summary(raw_df[,c("age", "trestbps", "chol", "thalach", "oldpeak")])
##       age           trestbps          chol          thalach         oldpeak    
##  Min.   :29.00   Min.   : 94.0   Min.   :126.0   Min.   : 71.0   Min.   :0.00  
##  1st Qu.:47.50   1st Qu.:120.0   1st Qu.:211.0   1st Qu.:133.5   1st Qu.:0.00  
##  Median :55.00   Median :130.0   Median :240.0   Median :153.0   Median :0.80  
##  Mean   :54.37   Mean   :131.6   Mean   :246.3   Mean   :149.6   Mean   :1.04  
##  3rd Qu.:61.00   3rd Qu.:140.0   3rd Qu.:274.5   3rd Qu.:166.0   3rd Qu.:1.60  
##  Max.   :77.00   Max.   :200.0   Max.   :564.0   Max.   :202.0   Max.   :6.20
# Save this as an intermediate variable named to_scale
to_scale <- raw_df[,c("age", "trestbps", "chol", "thalach", "oldpeak")]

Also see ?ck37r::standardize for a simple extension

Remove the variables to be scaled and save the remaining variables in a dataframe named removed.

removed <- raw_df[,-c(1, # age
                      3, # trestbps
                      4, # chol
                      6, # thalach
                      8  # oldpeak
                      )]

6.5 5. Scale the continuous variables

Rescale these variables to a rang of 0 to 1 in a variable named rescaled.

rescaled <- as.data.frame(rescale(as.matrix(to_scale), to = c(0,1), ncol = 1))
summary(rescaled)
##       age             trestbps           chol           thalach      
##  Min.   :0.05142   Min.   :0.1667   Min.   :0.2234   Min.   :0.1259  
##  1st Qu.:0.08422   1st Qu.:0.2128   1st Qu.:0.3741   1st Qu.:0.2367  
##  Median :0.09752   Median :0.2305   Median :0.4255   Median :0.2713  
##  Mean   :0.09639   Mean   :0.2334   Mean   :0.4366   Mean   :0.2653  
##  3rd Qu.:0.10816   3rd Qu.:0.2482   3rd Qu.:0.4867   3rd Qu.:0.2943  
##  Max.   :0.13652   Max.   :0.3546   Max.   :1.0000   Max.   :0.3582  
##     oldpeak        
##  Min.   :0.000000  
##  1st Qu.:0.000000  
##  Median :0.001418  
##  Mean   :0.001843  
##  3rd Qu.:0.002837  
##  Max.   :0.010993

6.6 6. Produce clean dataset

Finally, recombine the original data with the scaled variables for the final clean dataset.

clean <- cbind(removed,       # cleaned data without the scaled variables
               rescaled       # scaled variables
               )

# The scaled variables are the last five!
str(clean)
## 'data.frame':    303 obs. of  23 variables:
##  $ sex      : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ fbs      : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ exang    : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ target   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ cp_1     : int  0 0 1 1 0 0 1 1 0 0 ...
##  $ cp_2     : int  0 1 0 0 0 0 0 0 1 1 ...
##  $ cp_3     : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ restecg_1: int  0 1 0 1 1 1 0 1 1 1 ...
##  $ restecg_2: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ slope_1  : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ slope_2  : int  0 0 1 1 1 0 0 1 1 1 ...
##  $ ca_1     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_2     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_3     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_4     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal_1   : int  1 0 0 0 0 1 0 0 0 0 ...
##  $ thal_2   : int  0 1 1 1 1 0 1 0 0 1 ...
##  $ thal_3   : int  0 0 0 0 0 0 0 1 1 0 ...
##  $ age      : num  0.1117 0.0656 0.0727 0.0993 0.1011 ...
##  $ trestbps : num  0.257 0.23 0.23 0.213 0.213 ...
##  $ chol     : num  0.413 0.443 0.362 0.418 0.628 ...
##  $ thalach  : num  0.266 0.332 0.305 0.316 0.289 ...
##  $ oldpeak  : num  0.00408 0.00621 0.00248 0.00142 0.00106 ...

6.7 7. Save the clean dataset

Save the clean dataset so you do not have to repeat these preprocessing steps.

save(clean,                                  # variable(s) to be saved
     file = "data/preprocessed/clean.RData") # file location and name

7 Wipe your global environment clean

Reload the clean dataset with:

load("data/preprocessed/clean.RData")
str(clean)
## 'data.frame':    303 obs. of  23 variables:
##  $ sex      : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ fbs      : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ exang    : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ target   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ cp_1     : int  0 0 1 1 0 0 1 1 0 0 ...
##  $ cp_2     : int  0 1 0 0 0 0 0 0 1 1 ...
##  $ cp_3     : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ restecg_1: int  0 1 0 1 1 1 0 1 1 1 ...
##  $ restecg_2: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ slope_1  : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ slope_2  : int  0 0 1 1 1 0 0 1 1 1 ...
##  $ ca_1     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_2     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_3     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_4     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal_1   : int  1 0 0 0 0 1 0 0 0 0 ...
##  $ thal_2   : int  0 1 1 1 1 0 1 0 0 1 ...
##  $ thal_3   : int  0 0 0 0 0 0 0 1 1 0 ...
##  $ age      : num  0.1117 0.0656 0.0727 0.0993 0.1011 ...
##  $ trestbps : num  0.257 0.23 0.23 0.213 0.213 ...
##  $ chol     : num  0.413 0.443 0.362 0.418 0.628 ...
##  $ thalach  : num  0.266 0.332 0.305 0.316 0.289 ...
##  $ oldpeak  : num  0.00408 0.00621 0.00248 0.00142 0.00106 ...

Separate scripts All steps up to this point could have easily occurred in a separate script. Good scripting practices - and efficient project management - are encouraged for any machine learning project. Contact SSDS if you want to learn more.

8 Machine learnign in practice: How to stay organized

One handy way to keep your machine learning organized is by storing all the components in a list. Let’s call our list heart_class, for our task of heart disease classification as being either present or absent.

Our first two list components will be the dataset and outcome variable. In the heart dataset, the target variable is coded with a 1 if the patient had heart disease and a 0 if they did not.

heart_class <- list(
  data = clean, 
  outcome = "target"
)

# View the list contents
names(heart_class)
## [1] "data"    "outcome"
# Access part of the list with the dollar sign
head(heart_class$data)
##   sex fbs exang target cp_1 cp_2 cp_3 restecg_1 restecg_2 slope_1 slope_2 ca_1
## 1   1   1     0      1    0    0    1         0         0       0       0    0
## 2   1   0     0      1    0    1    0         1         0       0       0    0
## 3   0   0     0      1    1    0    0         0         0       0       1    0
## 4   1   0     0      1    1    0    0         1         0       0       1    0
## 5   0   0     1      1    0    0    0         1         0       0       1    0
## 6   1   0     0      1    0    0    0         1         0       1       0    0
##   ca_2 ca_3 ca_4 thal_1 thal_2 thal_3        age  trestbps      chol   thalach
## 1    0    0    0      1      0      0 0.11170213 0.2570922 0.4131206 0.2659574
## 2    0    0    0      0      1      0 0.06560284 0.2304965 0.4432624 0.3315603
## 3    0    0    0      0      1      0 0.07269504 0.2304965 0.3617021 0.3049645
## 4    0    0    0      0      1      0 0.09929078 0.2127660 0.4184397 0.3156028
## 5    0    0    0      0      1      0 0.10106383 0.2127660 0.6276596 0.2890071
## 6    0    0    0      1      0      0 0.10106383 0.2482270 0.3404255 0.2624113
##        oldpeak
## 1 0.0040780142
## 2 0.0062056738
## 3 0.0024822695
## 4 0.0014184397
## 5 0.0010638298
## 6 0.0007092199
heart_class$outcome
## [1] "target"

8.1 Add covariates

# setdiff is a handy base R function to define the covariates as all variables *except* the outcome
heart_class$covariates <- setdiff(names(heart_class$data), heart_class$outcome)

# The covariate names appear in the heart_class list!
names(heart_class)
## [1] "data"       "outcome"    "covariates"
# Call the covariates with (notice that "target" is excluded)
heart_class$covariates
##  [1] "sex"       "fbs"       "exang"     "cp_1"      "cp_2"      "cp_3"     
##  [7] "restecg_1" "restecg_2" "slope_1"   "slope_2"   "ca_1"      "ca_2"     
## [13] "ca_3"      "ca_4"      "thal_1"    "thal_2"    "thal_3"    "age"      
## [19] "trestbps"  "chol"      "thalach"   "oldpeak"

8.2 Add training rows

# ?createDataPartition
set.seed(1) 
training_rows <- 
  caret::createDataPartition(heart_class$data[[heart_class$outcome]], 
                             p = 0.70, 
                             list = FALSE)

# assign the row indices
heart_class$train_rows <- training_rows

# We have added the training rows to our list
names(heart_class)
## [1] "data"       "outcome"    "covariates" "train_rows"
head(heart_class$train_rows)
##      Resample1
## [1,]         1
## [2,]         2
## [3,]         4
## [4,]         5
## [5,]         6
## [6,]         9

8.3 Split data into training and test sets

# use the train_rows indices to add the correct covariates and outcome to the training set
x_train <- heart_class$data[heart_class$train_rows, heart_class$covariates]
y_train <- heart_class$data[heart_class$train_rows, heart_class$outcome]

# note the use of the minus sign here
# this says to use the train_rows indices *not* assigned to the training set in the test set
x_test <- heart_class$data[-heart_class$train_rows, heart_class$covariates]
y_test <- heart_class$data[-heart_class$train_rows, heart_class$outcome]

# Mean and frequencies of training set
mean(heart_class$data[training_rows, "target"])
## [1] 0.5446009
table(heart_class$data[training_rows, "target"])
## 
##   0   1 
##  97 116
# Mean and frequencies of test set
mean(heart_class$data[-training_rows, "target"])
## [1] 0.5444444
table(heart_class$data[-training_rows, "target"])
## 
##  0  1 
## 41 49
# Add these variables to our list for safekeeping
heart_class$x_train <- x_train
heart_class$y_train <- y_train
heart_class$x_test <- x_test
heart_class$y_test <- y_test
names(heart_class)
## [1] "data"       "outcome"    "covariates" "train_rows" "x_train"   
## [6] "y_train"    "x_test"     "y_test"
# For example,
heart_class$y_train
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# Also save X and Y for the ensemble, since we will use all of the data in the cross-validation process
heart_class$Y <- clean$target
heart_class$X <- clean[,-4] 

8.3.1 Save the list heart_class

This way, you do not have to repeat all of these steps. You can just load("data/preprocessed/heart_class.RData")

names(heart_class)
##  [1] "data"       "outcome"    "covariates" "train_rows" "x_train"   
##  [6] "y_train"    "x_test"     "y_test"     "Y"          "X"
save(clean, heart_class, # save the clean dataset as well
     file = "data/preprocessed/heart_class.RData")

9 Again, wipe your global environment clean

# Then, load heart_class with the load function!
load("data/preprocessed/heart_class.RData")
names(heart_class)
##  [1] "data"       "outcome"    "covariates" "train_rows" "x_train"   
##  [6] "y_train"    "x_test"     "y_test"     "Y"          "X"
# or
str(heart_class)
## List of 10
##  $ data      :'data.frame':  303 obs. of  23 variables:
##   ..$ sex      : int [1:303] 1 1 0 1 0 1 0 1 1 1 ...
##   ..$ fbs      : int [1:303] 1 0 0 0 0 0 0 0 1 0 ...
##   ..$ exang    : int [1:303] 0 0 0 0 1 0 0 0 0 0 ...
##   ..$ target   : int [1:303] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ cp_1     : int [1:303] 0 0 1 1 0 0 1 1 0 0 ...
##   ..$ cp_2     : int [1:303] 0 1 0 0 0 0 0 0 1 1 ...
##   ..$ cp_3     : int [1:303] 1 0 0 0 0 0 0 0 0 0 ...
##   ..$ restecg_1: int [1:303] 0 1 0 1 1 1 0 1 1 1 ...
##   ..$ restecg_2: int [1:303] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ slope_1  : int [1:303] 0 0 0 0 0 1 1 0 0 0 ...
##   ..$ slope_2  : int [1:303] 0 0 1 1 1 0 0 1 1 1 ...
##   ..$ ca_1     : int [1:303] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ ca_2     : int [1:303] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ ca_3     : int [1:303] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ ca_4     : int [1:303] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ thal_1   : int [1:303] 1 0 0 0 0 1 0 0 0 0 ...
##   ..$ thal_2   : int [1:303] 0 1 1 1 1 0 1 0 0 1 ...
##   ..$ thal_3   : int [1:303] 0 0 0 0 0 0 0 1 1 0 ...
##   ..$ age      : num [1:303] 0.1117 0.0656 0.0727 0.0993 0.1011 ...
##   ..$ trestbps : num [1:303] 0.257 0.23 0.23 0.213 0.213 ...
##   ..$ chol     : num [1:303] 0.413 0.443 0.362 0.418 0.628 ...
##   ..$ thalach  : num [1:303] 0.266 0.332 0.305 0.316 0.289 ...
##   ..$ oldpeak  : num [1:303] 0.00408 0.00621 0.00248 0.00142 0.00106 ...
##  $ outcome   : chr "target"
##  $ covariates: chr [1:22] "sex" "fbs" "exang" "cp_1" ...
##  $ train_rows: int [1:213, 1] 1 2 4 5 6 9 10 15 16 17 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr "Resample1"
##  $ x_train   :'data.frame':  213 obs. of  22 variables:
##   ..$ sex      : int [1:213] 1 1 1 0 1 1 1 0 0 0 ...
##   ..$ fbs      : int [1:213] 1 0 0 0 0 1 0 1 0 0 ...
##   ..$ exang    : int [1:213] 0 0 0 1 0 0 0 0 0 0 ...
##   ..$ cp_1     : int [1:213] 0 0 1 0 0 0 0 0 0 0 ...
##   ..$ cp_2     : int [1:213] 0 1 0 0 0 1 1 0 1 1 ...
##   ..$ cp_3     : int [1:213] 1 0 0 0 0 0 0 1 0 0 ...
##   ..$ restecg_1: int [1:213] 0 1 1 1 1 1 1 0 1 1 ...
##   ..$ restecg_2: int [1:213] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ slope_1  : int [1:213] 0 0 0 0 1 0 0 0 1 0 ...
##   ..$ slope_2  : int [1:213] 0 0 1 1 0 1 1 1 0 1 ...
##   ..$ ca_1     : int [1:213] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ ca_2     : int [1:213] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ ca_3     : int [1:213] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ ca_4     : int [1:213] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ thal_1   : int [1:213] 1 0 0 0 1 0 0 0 0 0 ...
##   ..$ thal_2   : int [1:213] 0 1 1 1 0 0 1 1 1 1 ...
##   ..$ thal_3   : int [1:213] 0 0 0 0 0 1 0 0 0 0 ...
##   ..$ age      : num [1:213] 0.1117 0.0656 0.0993 0.1011 0.1011 ...
##   ..$ trestbps : num [1:213] 0.257 0.23 0.213 0.213 0.248 ...
##   ..$ chol     : num [1:213] 0.413 0.443 0.418 0.628 0.34 ...
##   ..$ thalach  : num [1:213] 0.266 0.332 0.316 0.289 0.262 ...
##   ..$ oldpeak  : num [1:213] 0.004078 0.006206 0.001418 0.001064 0.000709 ...
##  $ y_train   : int [1:213] 1 1 1 1 1 1 1 1 1 1 ...
##  $ x_test    :'data.frame':  90 obs. of  22 variables:
##   ..$ sex      : int [1:90] 0 0 1 1 0 1 1 0 1 0 ...
##   ..$ fbs      : int [1:90] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ exang    : int [1:90] 0 0 0 0 0 0 1 0 0 0 ...
##   ..$ cp_1     : int [1:90] 1 1 1 0 0 1 0 0 0 1 ...
##   ..$ cp_2     : int [1:90] 0 0 0 0 1 0 0 0 0 0 ...
##   ..$ cp_3     : int [1:90] 0 0 0 0 0 0 1 1 0 0 ...
##   ..$ restecg_1: int [1:90] 0 0 1 1 1 1 0 1 1 1 ...
##   ..$ restecg_2: int [1:90] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ slope_1  : int [1:90] 0 1 0 0 0 0 1 0 1 0 ...
##   ..$ slope_2  : int [1:90] 1 0 1 1 1 1 0 1 0 1 ...
##   ..$ ca_1     : int [1:90] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ ca_2     : int [1:90] 0 0 0 0 0 0 0 1 0 1 ...
##   ..$ ca_3     : int [1:90] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ ca_4     : int [1:90] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ thal_1   : int [1:90] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ thal_2   : int [1:90] 1 1 0 1 1 1 1 1 0 1 ...
##   ..$ thal_3   : int [1:90] 0 0 1 0 0 0 0 0 1 0 ...
##   ..$ age      : num [1:90] 0.0727 0.0993 0.078 0.0957 0.0851 ...
##   ..$ trestbps : num [1:90] 0.23 0.248 0.213 0.248 0.23 ...
##   ..$ chol     : num [1:90] 0.362 0.521 0.466 0.424 0.488 ...
##   ..$ thalach  : num [1:90] 0.305 0.271 0.307 0.284 0.246 ...
##   ..$ oldpeak  : num [1:90] 0.002482 0.002305 0 0.002128 0.000355 ...
##  $ y_test    : int [1:90] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Y         : int [1:303] 1 1 1 1 1 1 1 1 1 1 ...
##  $ X         :'data.frame':  303 obs. of  22 variables:
##   ..$ sex      : int [1:303] 1 1 0 1 0 1 0 1 1 1 ...
##   ..$ fbs      : int [1:303] 1 0 0 0 0 0 0 0 1 0 ...
##   ..$ exang    : int [1:303] 0 0 0 0 1 0 0 0 0 0 ...
##   ..$ cp_1     : int [1:303] 0 0 1 1 0 0 1 1 0 0 ...
##   ..$ cp_2     : int [1:303] 0 1 0 0 0 0 0 0 1 1 ...
##   ..$ cp_3     : int [1:303] 1 0 0 0 0 0 0 0 0 0 ...
##   ..$ restecg_1: int [1:303] 0 1 0 1 1 1 0 1 1 1 ...
##   ..$ restecg_2: int [1:303] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ slope_1  : int [1:303] 0 0 0 0 0 1 1 0 0 0 ...
##   ..$ slope_2  : int [1:303] 0 0 1 1 1 0 0 1 1 1 ...
##   ..$ ca_1     : int [1:303] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ ca_2     : int [1:303] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ ca_3     : int [1:303] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ ca_4     : int [1:303] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ thal_1   : int [1:303] 1 0 0 0 0 1 0 0 0 0 ...
##   ..$ thal_2   : int [1:303] 0 1 1 1 1 0 1 0 0 1 ...
##   ..$ thal_3   : int [1:303] 0 0 0 0 0 0 0 1 1 0 ...
##   ..$ age      : num [1:303] 0.1117 0.0656 0.0727 0.0993 0.1011 ...
##   ..$ trestbps : num [1:303] 0.257 0.23 0.23 0.213 0.213 ...
##   ..$ chol     : num [1:303] 0.413 0.443 0.362 0.418 0.628 ...
##   ..$ thalach  : num [1:303] 0.266 0.332 0.305 0.316 0.289 ...
##   ..$ oldpeak  : num [1:303] 0.00408 0.00621 0.00248 0.00142 0.00106 ...

10 Fit single models

It can be good to fit a single model to prototype your machine learning task.

10.1 View available models

SuperLearner supports a wide variety of useful algorithms:

listWrappers()
## All prediction algorithm wrappers in SuperLearner:
##  [1] "SL.bartMachine"      "SL.bayesglm"         "SL.biglasso"        
##  [4] "SL.caret"            "SL.caret.rpart"      "SL.cforest"         
##  [7] "SL.earth"            "SL.extraTrees"       "SL.gam"             
## [10] "SL.gbm"              "SL.glm"              "SL.glm.interaction" 
## [13] "SL.glmnet"           "SL.ipredbagg"        "SL.kernelKnn"       
## [16] "SL.knn"              "SL.ksvm"             "SL.lda"             
## [19] "SL.leekasso"         "SL.lm"               "SL.loess"           
## [22] "SL.logreg"           "SL.mean"             "SL.nnet"            
## [25] "SL.nnls"             "SL.polymars"         "SL.qda"             
## [28] "SL.randomForest"     "SL.ranger"           "SL.ridge"           
## [31] "SL.rpart"            "SL.rpartPrune"       "SL.speedglm"        
## [34] "SL.speedlm"          "SL.step"             "SL.step.forward"    
## [37] "SL.step.interaction" "SL.stepAIC"          "SL.svm"             
## [40] "SL.template"         "SL.xgboost"
## 
## All screening algorithm wrappers in SuperLearner:
## [1] "All"
## [1] "screen.corP"           "screen.corRank"        "screen.glmnet"        
## [4] "screen.randomForest"   "screen.SIS"            "screen.template"      
## [7] "screen.ttest"          "write.screen.template"

Let’s fit three, one at a time. Click the links for more information.

  1. Single decision tree

  2. Random forest

  3. Lasso

10.2 Decision tree

The decision tree algorithm can be useful because it is relatively simple to construct and interpret. They can work with categorical data directly, without the need for one-hot encoding.

Decision trees work by recursively partitioning, or dividing the feature space into smaller regions that contain less data. Splitting takes place at nodes and each directional split is called a branch. The top node is called the root node and seeks to find the most discriminating split, thus always putting the most optimal partition first. The tree stops partitioning when it fails to meet some hyperparameter condition. See ?rpart.control for a list of tunings.

Keep in mind that single decision trees are prone to overfitting and high variance in results on different training data.

Unhashtag the below code to reproduce the tree figure for the training data.

?rpart
?rpart.control
# install.packages("rpart.plot")
# library(rpart.plot)
# (dt_example_plot <- rpart(heart_class$y_train ~ .,
#                           data = heart_class$x_train,
#                           method = "class",
#                           parms = list(split = "information")))
# rpart.plot(dt_example_plot)

training set decision tree plot

10.2.1 Train the model with SuperLearner

set.seed(1)
dt <- SuperLearner(Y = heart_class$y_train, 
                   X = heart_class$x_train, 
                   family = binomial(),
                   SL.library = "SL.rpart")
dt
## 
## Call:  
## SuperLearner(Y = heart_class$y_train, X = heart_class$x_train, family = binomial(),  
##     SL.library = "SL.rpart") 
## 
## 
##                   Risk Coef
## SL.rpart_All 0.1743429    1
# Accuracy
(dt_acc <- 1 - dt$cvRisk)
## SL.rpart_All 
##    0.8256571

10.2.2 Test set AUC-ROC:

# Predict on test set
dt_predictions = predict(dt, heart_class$x_test, onlySL = TRUE)
str(dt_predictions)
## List of 2
##  $ pred           : num [1:90, 1] 0.917 0.917 0.652 0.917 0.917 ...
##  $ library.predict: num [1:90, 1] 0.917 0.917 0.652 0.917 0.917 ...
summary(dt_predictions$library.predict)
##        V1         
##  Min.   :0.09091  
##  1st Qu.:0.11111  
##  Median :0.65217  
##  Mean   :0.59937  
##  3rd Qu.:0.91667  
##  Max.   :0.91667
# Visualize predicted probabilities
qplot(dt_predictions$pred[, 1]) + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Compute AUC-ROC and save as variable
dt_rocr_score = ROCR::prediction(dt_predictions$pred, heart_class$y_test)
dt_auc_roc = ROCR::performance(dt_rocr_score, measure = "auc", x.measure = "cutoff")@y.values[[1]]

# AUC-ROC
dt_auc_roc
## [1] 0.7366849

10.3 Random forest

The random forest algorithm is a big step up from a single decision tree, and is in itself an ensemble method.

It is a forest because it grows multiple decision trees.

It is random because it bootstrap samples (with replacement) two-thirds of the dataset for each tree (this is called bootstrap aggregating, or “bagging”) and evaluates performance on the remaining one-third of the data (called the “out-of-bag” sample). Performance is averaged across the trees.

It is also random because instead of trying to place the most discriminating split as the root node, it selects a random sample of features to try at each split. This allows for good splits to be followed by bad splits, and for bad splits to be followed by good splits thus helping decorrelate the average performance estimator for a more robust conclusion.

See ?ranger for hyperparameter tuning options.

?ranger

10.3.1 Train the model

set.seed(1)
rf <- SuperLearner(Y = heart_class$y_train, 
                   X = heart_class$x_train, 
                   family = binomial(),
                   SL.library = "SL.ranger")
rf
## 
## Call:  
## SuperLearner(Y = heart_class$y_train, X = heart_class$x_train, family = binomial(),  
##     SL.library = "SL.ranger") 
## 
## 
##                    Risk Coef
## SL.ranger_All 0.1293912    1
# Accuracy
(rf_acc <- 1 - rf$cvRisk)
## SL.ranger_All 
##     0.8706088

10.3.2 Test set AUC-ROC:

rf_predictions = predict(rf, heart_class$x_test, onlySL = TRUE)
str(rf_predictions)
## List of 2
##  $ pred           : num [1:90, 1] 0.97 0.762 0.566 0.901 0.941 ...
##  $ library.predict: num [1:90, 1] 0.97 0.762 0.566 0.901 0.941 ...
summary(rf_predictions$library.predict)
##        V1          
##  Min.   :0.008188  
##  1st Qu.:0.337193  
##  Median :0.619751  
##  Mean   :0.592442  
##  3rd Qu.:0.874289  
##  Max.   :0.992931
# Visualize predicted probabilities
qplot(rf_predictions$pred[, 1]) + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Compute AUC-ROC and save as variable
rf_rocr_score = ROCR::prediction(rf_predictions$pred, heart_class$y_test)
rf_auc_roc = ROCR::performance(rf_rocr_score, measure = "auc", x.measure = "cutoff")@y.values[[1]]

# AUC-ROC 
rf_auc_roc
## [1] 0.8596317

10.4 Lasso

The lasso (least absolute square shrinkage operator) is a form of penalized regression and classification that shrinks the beta coefficients to zero of features that are not related to the outcome variable.

We can follow the “one standard error rule”, which allows us to select a higher lambda value (the regularization parameter) within one standard error of the minimum value and sacrifice a little error for a model that contains less variables but is ostensibly easier to interpret.

View the help files with ?glmnet and ?cv.glmnet to learn more.

?glmnet
?cv.glmnet
set.seed(1)
lasso <- cv.glmnet(x = as.matrix(heart_class$X), 
                   y = heart_class$Y, 
                   family = "binomial", 
                   alpha = 1)
lasso
## 
## Call:  cv.glmnet(x = as.matrix(heart_class$X), y = heart_class$Y, family = "binomial",      alpha = 1) 
## 
## Measure: Binomial Deviance 
## 
##      Lambda Index Measure      SE Nonzero
## min 0.01012    36  0.7631 0.06686      18
## 1se 0.03392    23  0.8285 0.04659      15
# View path plot
plot(lasso)

# View coefficients
plot(lasso$glmnet.fit, xvar = "lambda", label = TRUE)

# Show coefficients for 1se model
(coef_min = coef(lasso, s = "lambda.1se"))
## 23 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)   -0.5608427
## sex           -0.4290884
## fbs            .        
## exang         -0.6548822
## cp_1           0.1442183
## cp_2           0.6982435
## cp_3           0.3563876
## restecg_1      0.1032213
## restecg_2      .        
## slope_1       -0.3557496
## slope_2        0.1325356
## ca_1          -0.8761617
## ca_2          -1.2264879
## ca_3          -0.7966503
## ca_4           .        
## thal_1         .        
## thal_2         0.6554511
## thal_3        -0.5795217
## age            .        
## trestbps       .        
## chol           .        
## thalach        6.1897480
## oldpeak     -165.2682983

10.4.1 Lasso - train the model with SuperLearner

set.seed(1)
la <- SuperLearner(Y = heart_class$y_train, 
                   X = heart_class$x_train, 
                   family = binomial(),
                   SL.library = "SL.glmnet")
la
## 
## Call:  
## SuperLearner(Y = heart_class$y_train, X = heart_class$x_train, family = binomial(),  
##     SL.library = "SL.glmnet") 
## 
## 
##                    Risk Coef
## SL.glmnet_All 0.1178595    1
# Accuracy
(la_risk <- 1-la$cvRisk)
## SL.glmnet_All 
##     0.8821405

10.4.2 Lasso test set AUC-ROC:

la_predictions = predict(la, heart_class$x_test, onlySL = TRUE)
str(la_predictions)
## List of 2
##  $ pred           : num [1:90, 1] 0.916 0.74 0.749 0.859 0.965 ...
##  $ library.predict: num [1:90, 1] 0.916 0.74 0.749 0.859 0.965 ...
summary(la_predictions$library.predict)
##        V1          
##  Min.   :0.008801  
##  1st Qu.:0.299391  
##  Median :0.633026  
##  Mean   :0.580456  
##  3rd Qu.:0.903609  
##  Max.   :0.988807
# Visualize predicted probabilities
qplot(la_predictions$pred[, 1]) + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Compute AUC-ROC and save as variable
la_rocr_score = ROCR::prediction(la_predictions$pred, heart_class$y_test)
la_auc_roc = ROCR::performance(la_rocr_score, measure = "auc", x.measure = "cutoff")@y.values[[1]]

# AUC-ROC 
la_auc_roc
## [1] 0.8939771

11 Ensemble

We have now tried three algorithms: the decision tree performed relatively poorly compared to the random forest and lasso. So, how do we know which algorithm to choose? Let’s fit an ensemble with two more algorithms:

  1. xgboost

  2. The mean of Y

11.1 Summary of algorithm definitions

11.2 Train the ensemble of models

set.seed(1)
system.time({
cv_sl <- CV.SuperLearner(Y = heart_class$Y,  # heart disease yes/no
                         X = heart_class$X,  # excluding the "target" variable
                         family = binomial(),
                         # For your own research, 5, 10, or 20 are good options
                         cvControl = list(V = 5), 
                         innerCvControl = list(list(V = 5)),
                         verbose = FALSE, 
                         method = "method.NNLS",
                         SL.library = c("SL.rpart",   # decision tree
                                        "SL.ranger",  # random forest
                                        "SL.glmnet",  # lasso
                                        "SL.xgboost", # boosted trees
                                        "SL.mean"))   # Y mean
})
## Warning in CV.SuperLearner(Y = heart_class$Y, X = heart_class$X, family =
## binomial(), : Only a single innerCvControl is given, will be replicated across
## all cross-validation split calls to SuperLearner
##    user  system elapsed 
##  14.819   0.465  14.893

11.3 Explore the output

# View the function call
cv_sl
## 
## Call:  
## CV.SuperLearner(Y = heart_class$Y, X = heart_class$X, family = binomial(),  
##     SL.library = c("SL.rpart", "SL.ranger", "SL.glmnet", "SL.xgboost", "SL.mean"),  
##     method = "method.NNLS", verbose = FALSE, cvControl = list(V = 5), innerCvControl = list(list(V = 5))) 
## 
## 
## 
## Cross-validated predictions from the SuperLearner:  SL.predict 
## 
## Cross-validated predictions from the discrete super learner (cross-validation selector):  discreteSL.predict 
## 
## Which library algorithm was the discrete super learner:  whichDiscreteSL 
## 
## Cross-validated prediction for all algorithms in the library:  library.predict
# View the risk table
summary(cv_sl)
## 
## Call:  
## CV.SuperLearner(Y = heart_class$Y, X = heart_class$X, family = binomial(),  
##     SL.library = c("SL.rpart", "SL.ranger", "SL.glmnet", "SL.xgboost", "SL.mean"),  
##     method = "method.NNLS", verbose = FALSE, cvControl = list(V = 5), innerCvControl = list(list(V = 5))) 
## 
## 
## Risk is based on: Mean Squared Error
## 
## All risk estimates are based on V =  5 
## 
##       Algorithm     Ave        se      Min     Max
##   Super Learner 0.11722 0.0113943 0.088134 0.15460
##     Discrete SL 0.11607 0.0116867 0.084784 0.15543
##    SL.rpart_All 0.18315 0.0167049 0.142558 0.23328
##   SL.ranger_All 0.13734 0.0105342 0.122008 0.16847
##   SL.glmnet_All 0.11607 0.0116867 0.084784 0.15543
##  SL.xgboost_All 0.14758 0.0139655 0.110688 0.19415
##     SL.mean_All 0.24944 0.0026639 0.244474 0.25555
# View the discrete winner
table(simplify2array(cv_sl$whichDiscreteSL))
## 
## SL.glmnet_All 
##             5
# View the AUC-ROC table for: 
# 1. The individual algorithms
# 2. The discrete winner
# 3. The SuperLearner ensemble
ck37r::auc_table(cv_sl)
##                      auc         se  ci_lower  ci_upper      p-value
## SL.mean_All    0.5000000 0.05767758 0.3869540 0.6130460 4.666398e-13
## SL.rpart_All   0.7753934 0.03589421 0.7050420 0.8457447 7.213063e-05
## SL.xgboost_All 0.8657012 0.02082595 0.8248831 0.9065194 1.340076e-02
## SL.ranger_All  0.8851278 0.01857296 0.8487255 0.9215302 7.535160e-02
## SuperLearner   0.9100506 0.01667676 0.8773648 0.9427365 4.577962e-01
## SL.glmnet_All  0.9118182 0.01638189 0.8797102 0.9439261 5.000000e-01
## DiscreteSL     0.9118182 0.01638189 0.8797102 0.9439261 5.000000e-01
# Visualize the cross-validated risk
plot.CV.SuperLearner(cv_sl) + theme_bw()

# Print the weights table
cvsl_weights(cv_sl)
##   # Learner    Mean      SD     Min     Max
## 1 1  glmnet 0.84904 0.08951 0.77596 1.00000
## 2 2  ranger 0.13957 0.08266 0.00000 0.22044
## 3 3   rpart 0.00811 0.01620 0.00000 0.03695
## 4 4 xgboost 0.00328 0.00733 0.00000 0.01638
## 5 5    mean 0.00000 0.00000 0.00000 0.00000
# Compute ensemble AUC-ROC 
ck37r::cvsl_auc(cv_sl)
## $cvAUC
## [1] 0.9100506
## 
## $se
## [1] 0.01667676
## 
## $ci
## [1] 0.8773648 0.9427365
## 
## $confidence
## [1] 0.95
# Plot AUC-ROC curve
ck37r::plot_roc(cv_sl)

# View rough estimates of variable importance
set.seed(1)
var_importance <- ck37r::vim_corr(covariates = heart_class$covariates, 
                                  outcome = heart_class$outcome, 
                                  data = clean, 
                                  bootse = FALSE, 
                                  verbose = TRUE)
var_importance
##     rank  variable        corr      p_value  p_value_fdr     avg_con
## 015    1    thal_2  0.52733355 4.356225e-23 9.583695e-22 0.260869565
## 016    2    thal_3 -0.48611215 2.241269e-19 2.465396e-18 0.644927536
## 02     3     exang -0.43675708 1.520814e-15 1.115263e-14 0.550724638
## 021    4   oldpeak -0.43069600 4.085346e-15 2.246941e-14 0.002811183
## 020    5   thalach  0.42174093 1.697338e-14 7.468286e-14 0.246633775
## 09     6   slope_2  0.39406637 1.068761e-12 3.918791e-12 0.253623188
## 08     7   slope_1 -0.36205330 8.142433e-11 2.559051e-10 0.659420290
## 04     8      cp_2  0.31674216 1.735460e-08 4.772515e-08 0.130434783
## 0      9       sex -0.28093658 6.678692e-07 1.632569e-06 0.826086957
## 011   10      ca_2 -0.27399787 1.280668e-06 2.817470e-06 0.224637681
## 03    11      cp_1  0.24587910 1.498284e-05 2.996568e-05 0.065217391
## 010   12      ca_1 -0.23241224 4.409069e-05 8.083294e-05 0.318840580
## 017   13       age -0.22543872 7.524801e-05 1.273428e-04 0.100357180
## 012   14      ca_3 -0.21061527 2.220536e-04 3.489414e-04 0.123188406
## 06    15 restecg_1  0.17532180 2.191648e-03 3.214418e-03 0.405797101
## 018   16  trestbps -0.14493113 1.154606e-02 1.587583e-02 0.238295303
## 014   17    thal_1 -0.10658897 6.388288e-02 8.267196e-02 0.086956522
## 05    18      cp_3  0.08695687 1.309784e-01 1.600847e-01 0.050724638
## 019   19      chol -0.08523911 1.387903e-01 1.607046e-01 0.445189639
## 07    20 restecg_2 -0.06841024 2.351175e-01 2.586292e-01 0.021739130
## 013   21      ca_4  0.06644102 2.488996e-01 2.607520e-01 0.007246377
## 01    22       fbs -0.02804576 6.267775e-01 6.267775e-01 0.159420290
##        avg_case note
## 015 0.787878788     
## 016 0.169696970     
## 02  0.139393939     
## 021 0.001033742     
## 020 0.280969267     
## 09  0.648484848     
## 08  0.296969697     
## 04  0.418181818     
## 0   0.563636364     
## 011 0.042424242     
## 03  0.248484848     
## 010 0.127272727     
## 017 0.093079734     
## 012 0.018181818     
## 06  0.581818182     
## 018 0.229260692     
## 014 0.036363636     
## 05  0.096969697     
## 019 0.429486353     
## 07  0.006060606     
## 013 0.024242424     
## 01  0.139393939

12 Challenge

12.1 Customize model hyperparameters and refit the ensemble

Read Chapter 9 and Chapter 10 in the Guide to SuperLearner to learn to customize model hyperparameters and test these same algorithms with multiple hyperparameter settings. Refit the ensemble with a few new, tuned algorithms compared to the existing algorithms contained within the SL.library parameter inside of CV.SuperLearner above (“SL.rpart”, “SL.ranger”, “SL.glmnet”, “SL.xgboost”, “SL.mean”).

What changed, and how were you able to optimize performance?

How did you find out which hyperparameters to tune?

HINT: type ?rpart, ?ranger, etc.

Also, consider the below points to boost performance: - Change nominal factors to ordinal type when appropriate
- Explore other ways to scale numeric variables
- Include other algorithms
- Utilize other/more performance metrics
- Read more

14 Part 2 - Image classification with Keras in R

(coming soon)